Allow configuring collection of statistics during TPC-H benchmarks #3889

isidentical · 2022-10-19T00:20:01Z

Which issue does this PR close?

Closes #3888 .

Rationale for this change

As described in #3888, it is useful to see a benchmark with/without statistics to determine what sort of effect it will have.

What changes are included in this PR?

A new flag, --disable-statistics to turn off statistic collection.

Are there any user-facing changes?

No.

isidentical · 2022-10-19T01:56:11Z

There are currently only two physical plans that differ when we pass --disable-stats, which is I guess the most reliable indication that they are the only plans that we actually use statistics. Here are the results from them in regards to on/off execution (which looks amazing, thanks to all the previous work on hash build side switching by @Dandandan and others)

I'll try to inspect other plans to see what else we can do (or if there is anything obvious we are missing, except filters which are in progress in #3868). But I'd expect the majority of the speed-ups in here to come from global join ordering (unlike only locals we have) with #3843 so no high hopes to see a magic bullet just yet 😇

Dandandan · 2022-10-19T02:39:00Z

benchmarks/src/bin/tpch.rs

@@ -98,6 +98,10 @@ struct DataFusionBenchmarkOpt {
    /// Path to output directory where JSON summary file should be written to
    #[structopt(parse(from_os_str), short = "o", long = "output")]
    output_path: Option<PathBuf>,
+
+    /// Whether to disable collection of statistics (and cost based optimizations) or not.
+    #[structopt(short = "S", long = "disable-statistics")]


It might be good to keep defaults the same as datafusion-cli (statistics disabled)?

I was also thinking about it but then decided against it since the default is already true. Do you know whether we have any places where we continuously run benchmarks where making the default false would change results? If it wouldn't cause a regression on places where we compare benchmarks across datafusion commits/versions, I think it should be an easy to change.

There is no current "continuous benchmarking" in place for this. If that would be the case, it would make sense to run the benchmarks both with and without collecting statistics.

Anyway, good that we can enable / disable them 👍

Dandandan · 2022-10-19T20:08:06Z

Thanks @isidentical

isidentical · 2022-10-19T20:27:26Z

Thanks for resolving conflicts @Dandandan 🙏🏻 By the way, just as a question (or maybe something we might be interested to discuss in general), are there any plans/efforts to do continuous benchmarking to ensure that no regression goes unnoticed? (I was very interested in @andygrove 's benchmarking of ballista/spark, so maybe even doing it both for ballista/datafusion).

isidentical changed the title ~~Allow enabling collection of statistics during TPC-H benchmarks~~ Allow configuring collection of statistics during TPC-H benchmarks Oct 19, 2022

Allow enabling collection of statistics during TPC-H benchmarks

edf4e82

isidentical force-pushed the gh-3888 branch from 631027d to edf4e82 Compare October 19, 2022 01:31

isidentical marked this pull request as ready for review October 19, 2022 01:56

Dandandan reviewed Oct 19, 2022

View reviewed changes

Dandandan mentioned this pull request Oct 19, 2022

Provide way to control collect_stat for the table provider apache/datafusion-ballista#406

Open

Merge remote-tracking branch 'upstream/master' into apachegh-3888

6808794

Dandandan approved these changes Oct 19, 2022

View reviewed changes

Dandandan merged commit c87964c into apache:master Oct 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow configuring collection of statistics during TPC-H benchmarks #3889

Allow configuring collection of statistics during TPC-H benchmarks #3889

isidentical commented Oct 19, 2022 •

edited

Loading

isidentical commented Oct 19, 2022

Dandandan Oct 19, 2022

isidentical Oct 19, 2022

Dandandan Oct 19, 2022

Dandandan commented Oct 19, 2022

isidentical commented Oct 19, 2022

Allow configuring collection of statistics during TPC-H benchmarks #3889

Allow configuring collection of statistics during TPC-H benchmarks #3889

Conversation

isidentical commented Oct 19, 2022 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

isidentical commented Oct 19, 2022

Dandandan Oct 19, 2022

Choose a reason for hiding this comment

isidentical Oct 19, 2022

Choose a reason for hiding this comment

Dandandan Oct 19, 2022

Choose a reason for hiding this comment

Dandandan commented Oct 19, 2022

isidentical commented Oct 19, 2022

isidentical commented Oct 19, 2022 •

edited

Loading